Errors-in-variables models

In statistics and econometrics, errors-in-variables models or measurement errors models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias. In non-linear models the direction of the bias is likely to be more complicated.[1]

Contents

Motivational example

Consider a simple linear regression model of the form


    y_t = \alpha %2B \beta x_t^* %2B \varepsilon_t\,, \quad t=1,\ldots,T,

where x* denotes the true but unobserved value of the regressor. Instead we observe this value with an error:


    x_t = x^*_t %2B \eta_t\,,

where the measurement error ηt is assumed to be independent from the true value x*t.

If the yt′s are simply regressed on the xt′s (see simple linear regression), then the estimator for the slope coefficient is


    \hat\beta = \frac{\tfrac{1}{T}\sum_{t=1}^T(x_t-\bar{x})(y_t-\bar{y})}
                     {\tfrac{1}{T}\sum_{t=1}^T(x_t-\bar{x})^2}\,,

which converges as the sample size T increases without bound:


    \hat\beta\ \xrightarrow{p}\ 
      \frac{\operatorname{Cov}[\,x_t,y_t\,]}{\operatorname{Var}[\,x_t\,]}
      = \frac{\beta \sigma^2_{x^*}} {\sigma_{x^*}^2 %2B \sigma_\eta^2}
      = \frac{\beta} {1 %2B \sigma_\eta^2/\sigma_{x^*}^2}\,.

The two variances here are positive, so that in the limit the estimate is smaller in magnitude than the true value of β, an effect which statisticians call attenuation or regression dilution.[2] Thus the “naїve” least squares estimator is inconsistent in this setting. However, the estimator is a consistent estimator of the parameter required for a best linear predictor of y given x: in some applications this may be what is required, rather than an estimate of the "true" regression coefficient, although that what assume that the variance of the errors in observing x* remains fixed.

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous). Jerry Hausman sees this as an iron law of econometrics: “The magnitude of the estimate is usually smaller than expected.”[3]

Specification

Usually measurement error models are described using the latent variables approach. If y is the response variable and x are observed values of the regressors, then we assume there exist some latent variables y* and x* which follow the model's “true” functional relationship g, and such that the observed quantities are their noisy observations:

\begin{cases}
  x = x^* %2B \eta, \\
  y = y^* %2B \varepsilon, \\
  y^* = g(x^*\!,w\,|\,\theta),
  \end{cases}

where θ is the model's parameter and w are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no “measurement errors”). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of η's are zero.

The variables y, x, w are all observed, meaning that the statistician possesses a data set of n statistical units {yi, xi, wi}i = 1, ..., n which follow the data generating process described above; the latent variables x*, y*, ε, and η are not observed however.

This specification does not encompass all the existing EiV models. For example in some of them function g may be non-parametric or semi-parametric. Other approaches model the relationship between y* and x* as distributional instead of functional, that is they assume that y* conditionally on x* follows a certain (usually parametric) distribution.

Terminology and assumptions

Linear model

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariate case is not straightforward.

Simple linear model

The simple linear errors-in-variables model was already presented in the “motivation” section:

\begin{cases}
    y_t = \alpha %2B \beta x_t^* %2B \varepsilon_t, \\
    x_t = x_t^* %2B \eta_t,
  \end{cases}

where all variables are scalar. Here α and β are the parameters of interest, whereas σε and ση — standard deviations of the error terms — are the nuisance parameters. The “true” regressor x* is treated as a random variable (structural model), independent from the measurement error η (classic assumption).

This model is identifiable in two cases: (1) either the latent regressor x* is not normally distributed, (2) or x* has normal distribution, but neither εt nor ηt are divisible by a normal distribution.[5] That is, the parameters α, β can be consistently estimated from the data set \scriptstyle(x_t,\,y_t)_{t=1}^T without any additional information, provided the latent regressor is not Gaussian.

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include:[6]

Newer estimation methods that do not assume knowledge of some of the parameters of the model, include:

Multivariate linear model

Multivariate model looks exactly like the linear model, only this time β, ηt, xt and x*t are 1 vectors.

\begin{cases}
    y_t = \alpha %2B \beta'x_t^* %2B \varepsilon_t, \\
    x_t = x_t^* %2B \eta_t.
  \end{cases}

The general identifiability condition for this model remains an open question. It is known however that in the case when (ε,η) are independent and jointly normal, the parameter β is identified if and only if it is impossible to find a non-singular k×k block matrix [a A] (where a is a 1 vector) such that a′x* is distributed normally and independently from A′x*.[8]

Some of the estimation methods for multivariate linear models are:

Non-linear models

A generic non-linear measurement error model takes form

\begin{cases}
  y_t = g(x^*_t) %2B \varepsilon_t, \\
  x_t = x^*_t %2B \eta_t.
  \end{cases}

Here function g can be either parametric or non-parametric. When function g is parametric it will be written as g(x*, β).

For a general vector-valued regressor x* the conditions for model identifiability are not known. However in the case of scalar x* the model is identified unless the function g is of the “log-exponential” form [12]

g(x^*) = a %2B b \ln\big(e^{cx^*} %2B d\big)

and the latent regressor x* has density


    f_{x^*}(x) = \begin{cases}
               A e^{-Be^{Cx}%2BCDx}(e^{Cx}%2BE)^{-F}, & \text{if}\ d>0 \\
               A e^{-Bx^2 %2B Cx} & \text{if}\ d=0
             \end{cases}

where constants A,B,C,D,E,F may depend on a,b,c,d.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Instrumental variables methods

Repeated observations

In this approach two (or maybe more) repeated observations of the regressor x* are available. Both observations contain their own measurement errors, however those errors are required to be independent:

\begin{cases}
    x_{1t} = x^*_t %2B \eta_{1t}, \\
    x_{2t} = x^*_t %2B \eta_{2t},
  \end{cases}

where x*η1η2. Variables η1, η2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski’s deconvolution technique.[14]

Further reading

Notes

  1. ^ Griliches & Ringstad 1970, Chesher 1991
  2. ^ Greene 2003, Chapter 5.6.1
  3. ^ Hausman 2001, p. 58
  4. ^ Fuller 1987, p. 2
  5. ^ Reiersøl 1950, p. 383. A somewhat more restrictive result was established earlier by R. C. Geary in “Inherent relations between random variables”, Proceedings of Royal Irish Academy, vol.47 (1950). He showed that under the additional assumption that (ε, η) are jointly normal, the model is not identified if and only if x*’s are normal.
  6. ^ Fuller 1987, ch. 1
  7. ^ Pal 1980, §6
  8. ^ Bekker 1986. An earlier proof by Y. Willassen in “Extension of some results by Reiersøl to multivariate models”, Scand. J. Statistics, 6(2) (1979) contained errors.
  9. ^ Dagenais & Dagenais 1997. In the earlier paper (Pal 1980) considered a simpler case when all components in vector (ε, η) are independent and symmetrically distributed.
  10. ^ Fuller 1987, p. 184
  11. ^ Erickson & Whited 2002
  12. ^ Schennach, Hu & Lewbel 2007
  13. ^ Newey 2001
  14. ^ Li & Vuong 1998
  15. ^ Li 2002
  16. ^ Schennach 2004a
  17. ^ Schennach 2004b

References